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Abstract 

We propose an algorithmic framework for 
convex minimization problems of composite 
functions with two terms: a self-concordant 
part and a possibly nonsmooth regularization 
part. Our method is a new proximal New- 
ton algorithm with local quadratic conver- 
gence rate. As a specific problem instance, 
we consider sparse precision matrix estima- 
tion problems in graph learning. Via a careful 
dual formulation and a novel analytic step- 
size selection, we instantiate an algorithm 
within our framework for graph learning that 
avoids Cholesky decompositions and matrix 
inversions, making it attractive for parallel 
and distributed implementations. 



1. Introduction 

Sparse inverse covariance matrix estimation is a key 
step in graph learning. To understand the setup, let 
us consider learning a Gaussian Markov random field 
(GMRF) of p nodes/ variables from a dataset V := 
{xi, X2, . . . , Xm}, where Xj e 2? is a p-dimensional 
random vector, drawn from the Gaussian distribution 
A/'(/i., S). Let = be the inverse covariance (or 
the precision) matrix for the model. To satisfy the con- 
ditional independencies with respect to the GMRF, 
must have zero in 0.^ corresponding to the absence of 
an edge between nodes i and j (Dempster, 1972). 

To learn the underlying graph structure from , one 
can use the empirical covariance matrix S. Unfortu- 
nately, this approach is fundamentally ill-posed since 
the empirical estimates converge to the true covariance 
at a (l/-y/m)-rate (Dempster, 1972). Hence, inferring 



the true graph structure accurately requires an over- 
whelming number of samples. Unsurprisingly, we usu- 
ally have fewer samples than the ambient dimension, 
compounding the difficulty of estimation. 

While the possible GMRF structures are exponentially 
large, the most interesting graphs are rather simple 
with a sparse set of edges. Provable learning of such 
graphs can be achieved by £i-norm regularization in 
the maximum log-likelihood estimation: 



©*Gargmin|-logdot(©)+tr(S0)+p||vec(©)||i |, 



(1) 



= :/(0) 
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where p > is a parameter to balance the fidelity er- 
ror and the sparsity of the solution and vec is the 
vectorization operator. Here, /(0) corresponds to 
the empirical log-likelihood and g{&) is the sparsity- 
promoting term. Under this setting, the authors in 
(Ravikumar et al., 2011) prove that m = 0{d^ logp) is 
sufficient for correctly estimating the GMRF, where d 
is the graph node-degree. Moreover, the above formu- 
lation still makes sense for learning other graph mod- 
els, such as the Ising model, due to the connection of 
/(0) to the Bregman distance (Banerjee et al., 2008). 

Numerical solution methods for solving problem (1) 
have been widely studied in the literature. For in- 
stance, in (Banerjee et al., 2008; Scheinberg & Rish, 
2009; Scheinberg et al., 2010; Hsieh et al., 2011; 
Rolfs et al, 2012; Olsen et al., 2012) the authors pro- 
posed first order primal and dual approaches to (1) and 
used state-of-the-art structural convex optimization 
techniques such as coordinate descent methods and 
Lasso-based procedures. Alternatively, the authors in 
(Hsieh et al, 2011; Olsen et al., 2012) focused on the 
second order methods and, practically, achieved fast 
methods with a high accuracy. In (Scheinberg et al., 
2010; Yuan, 2012), the authors studied alternating 
direction methods to solve (1), while the work in 
(Li & Toh, 2010) is based on interior point-type meth- 
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ods. Algorithmic approaches where more structm'c is 
known a priori can be found in (Lu, 2010). 

The complexity of the state-of-the-art approaches 
mentioned above is dominated by the Cholesky decom- 
position {0[p^) in general), which currently creates an 
important scalability bottleneck. This decomposition 
appears mandatory since all these approaches employ 
a guess-and-check step-size selection procedures to en- 
sure the iterates remain in the positive definite (PD) 
cone and the inversion of a p x p matrix, whose theo- 
retical cost normally scales with the cost of p x p ma- 
trix multiplications (O(p^) direct, 0(p^'^°^) Strassen, 
and 0(p^ '^''^) Coppersmith- Winograd) . The inversion 
operation is seemingly mandatory in the optimization 
of (1) since the calculation of the descent direction 
V/(0i) := + requires it, and quadratic cost 

approximations to /(©) also need it. Via Cholesky 
decompositions, one can first check if the current solu- 
tion satisfies the PD cone constraint and then recycle 
the decomposition for inversion for the next iteration. 

Contributions: Wc propose a new proximal-Newton 
framework for solving the general problem of (1) by 
only assuming that /(•) is self-concordant. Our algo- 
rithm consists of two phases. In Phase 1, we apply a 
damped proximal Newton scheme with a new, analytic 
step-size selection procedure, and prove that our objec- 
tive function always decreases at least a certain fixed 
amount. As a result, we avoid globalization strategies 
such as backtracking line-search or trust-region proce- 
dures in the existing methods. Moreover, our step-size 
selection is optimal in the sense that it cannot be im- 
proved without additional assumptions on the problem 
structure. In Phase 2, we simply apply the full step 
proximal-Newton iteration as we get into its provable 
quadratic convergence region which we can compute 
explicitly. Moreover, we do not require any additional 
assumption such as the uniform boundedness of the 
Hessian as in (Lee et al., 2012). 

In the context of graph learning, we discuss a specific 
instance of our framework, which avoids Cholesky de- 
compositions and matrix inversions altogether. Hence, 
the per iteration complexity of our approach is domi- 
nated by the cost oipxp matrix multiplications. This 
is because (i) our analytic step-size selection procedure 
ensures the positive definiteness of the iterates doing 
away with global strategies such as line-search which 
demands the objective evaluations (via Cholesky), and 
(m) we avoid calculating the gradient explicitly, and 
hence matrix inversion by a careful dual formulation. 
As a result, our approach is attractive for distributed 
and parallel implementations. 

Paper outline: In Section 2. we first recall some 



fundamental concepts of convex optimization and self- 
concordant functions. Then, we describe the basic op- 
timization set up and show the unique solvability of 
the problem. In Section 3 we outline our algorithmic 
framework and describe its analytical complexity. We 
also deal with the solution of the subproblems by ap- 
plying the new dual approach in this section. Section 
4 presents an application of our theory to graph se- 
lection problems. Experimental results on real graph 
learning problems can be found in Section 5. 

2. Preliminaries 

Basic definitions: We reserve lower-case and bold 
lower-case letters for scalar and vector representation, 
respectively. Upper-case bold letters denote matri- 
ces. Let vec: R^^^" W be the vcctorization op- 
erator which maps a matrix to a single column, and 
mat: W — )■ W^^ is the inverse mapping of vec 
which transforms a vector to a matrix. For a closed 
convex function /, we denote its domain by dom(/), 
dom(/) := {x e R" I f[x) < +oo}. 

Definition 2.1 (Self-concordant functions (Definition 
2.1.1, pp. 12, (Nesterov & Nemirovski, 1994)). A con- 
vex function /i : R — > R is (standard) self- concordant 
if \h"'ix)\ < 2/i"(a;)3/2, Vx € R. Furthermore, a func- 
tion h : M" -> R is self- concordant if, for any t G R, 
the function (j){t) := ft,(x-|-tv) is self- concordant for all 
X e dom(/) and v € R". 

Let h e C'^(dom(/)) be a strictly convex and self- 
concordant function. For a given vector v G M", 
the local norm around x G dom(/) with respect to 

1 /2 

h{-) is defined as |lv||^ := (v'^V^ft.(x)v) while 
the corresponding dual norm is given as ||v||^ := 

(v^V2/i(x)"iv)^^^ Let a; : R ^ R+ be a function 
defined as Ld{t) :~ < — ln(l + 1) and : [0, 1] M.^ be 
a function defined as Lj^(t) := — t — ln(l — i). The func- 
tions Lj and are both nonnegative, strictly convex 
and increasing. Based on (Nesterov, 2004) [Theorems 
4.1.7 & 4.1.8], we recall the following estimates: 

^(||y - x||J + S/h{^f{y - x) + /i(x) < hiy), (2) 
Ky) < h{^) + V/j(x)^(y - x) + c.,(||y - xj|J, (3) 

where (2) holds for all x, y G dom(/), and (3) holds 
for all X, y G dom(/) such that j|y — xj|^ < 1. 

Problem statement: In this paper, we consider the 
following structural convex optimization problem: 

min|j^(x) I F(x):-/(x)+.g(x)|, (4) 

where /(x) is a convex, self-concordant function and 
(/(x) is a proper, lower semicontinuous and possibly 
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nonsmooth convex regularization term. It is easy to 
certify that problem (1) can be transformed into (4) 
by using the tranformation x := vec(0): 

f(^\ J — log det(mat(x))+tr(Smat(x)), mat(x) ;^ 0, 
1 +00 otherwise, 

(7(x) := p j|x|| and n :^ p^. 

Proximity operator: A basic tool to handle nons- 
mooth convex functions is the proximity operator: let 
g he a proper lower semicontinuous, possibly nons- 
mooth and convex in M". We denote by dg{x) the 
subdiffcrcntial of g at x. Let / be a self-concordant 
function and x e dom(/) be fixed. We define P^{u) := 
(V2/(x) + dgy^iu) for u e M". This operator is a 
nonexpansive mapping, i.e., 

||Pg^(u)-Pg^(v)||^< ||u-v||;, Vu,v. (5) 

Unique solvability of the problem: We generalize 
the result in (Hsieh et al., 2011) to show that problem 
(4) is uniquely solvable. 

Lemma 2.2. For some x e dom(P), let A(x) ;= 
||V/(x) v||* < 1 /or V e 5g(x). Then the solution 
X* of (4) exists and is unique. 

The proof of this lemma can be done similarly as Theo- 
rem 4.1.11, pp. 187 in (Nesterov, 2004). For complete- 
ness, we provide it in the supplementary document. 

3. Two-phase proximal Newton method 

Our algorithmic framework is simply a proximal- 
Newton method which generates an iterative sequence 
{x'^I^^Q starting from x'' € dom(P). The new point 

x'^+^ is computed as x^-'^^ — x'" 4- afed*^, where ak € 
(0, 1] is a step size and d'^ is the proximal- Newton- type 
direction as the solution to the subproblem: 

min{Q(d;x'=) + .g(x*+d)}. (Q(x'^)) 

d 

Here, (3(d;x'^) is the following quadratic surrogate of 
the function / around x''': 

Qid;^'^) /(x^-) + V/(x^)^d + id^V2/(x'=)d. (6) 

We denote d'^ the unique solution of Q(x'^). The op- 
timality condition for Q(x'^) is written as follows: 

e dg{x'' + d'') + V/(x''0 + V^/(x'^)d^ (7) 

Fixed-point characterization. For given x £ 
dom(F), if we define S'(x) := V2/(x)x - V/(x) then 
the unique solution d*^ of Q{x'^) can be computed as 

d'^ := [Pf o S) (x") - x^ = -(I - i?,)(x'=). (8) 



Here, Rg{-) := {P^ o S) {■) = Pg{S{-)). The next 
lemma shows that the fixed point of Rg is the unique 
solution of (4). The proof is straightforward, and is 
omitted. 

Lemma 3.1. Let Rg be a mapping defined by (8). 
Then x* is the unique solution of (4) if and only if:x* 
is the fixed-point of Rg, i.e., x* = i?g(x*). 

Lemma 3.1 suggests that we can generate an iterative 
sequence based on the fixed-point principle. Under 
certain assumptions, one can ensure that Rg is con- 
tractive and the sequence generated by this scheme is 
convergent. Hence, we characterize this below. 

3.1. Full-step proximal-Newton scheme 

Here, we show that if we start sufficiently close to the 
solution X*, then we can compute the next iteration 
x'^+i with full-step ak+i ~ 1, i.e., 

^fc+i .= x'^- + d^ (9) 

where d'^ is the unique solution to Q(x'^). We call 
this scheme the full-step proximal Newton (FPN) 
scheme. For any fc > 0, let us define 

Afc:=||x^+i-x'=||^,. (10) 

We refer to this quantity as the proximal Newton decre- 
ment. The following theorem establishes the local 
quadratic convergence of the FPN scheme (9). 

Theorem 3.2. For a given x*^, let x*^"*"^ be the point 
generated by the full-step proximal Newton scheme (9) 
and Afc be defined by (10). Then, if < 1 — ~ 
0.292893, it holds that 

Xk+i<{l-4.Xk + 2Xl)-'Xl. (11) 

Consequently, the sequence {x''"}^^^ generated by the 
FPN scheme (9) starting from x" G dom{F) such that 
Xo < a < a -.^ ~ 0.219224, locally converges to 

the unique solution of (4) at a quadratic rate. 

The proof of Theorem 3.2 can be found in the supple- 
mentary document. 

3.2. Damped proximal Newton scheme 

We now establish that, with an appropriate choice of 
the step-size a € (0, 1], the iterative sequence {x'^}^,^^ 
generated by the damped proximal Newton scheme 

y.k+1 . _ ^ Jfe ^2) 

is a decreasing sequence, i.e., F(x'^+^) < F{x'') — uj{a) 
whenever Afc > tr, where cr > is fixed. First, we show 
the following property for the new iteration x'^^^. 
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Lemma 3.3. Suppose that x +^ is a point generated 
by (12). Then, we have 

F(x^+i) < F(x^-) - [akXl - Lo*iakXk)] , (13) 
provided that a^Xk < 1. 

Proof. Let y*"' = x*^ + d*^ , where d*^ is the unique solu- 
tion of Q(x*''). It follows from the optiniality condition 
of (7) that there exists G dg{y^) such that 

vfc = -V/(x^-) - V2/(x^)(y^ - x^). (14) 

Since / is self-concordant, by (3), for any x'^+^ such 
that Afc = ||x'^+-'- — x'"'|| ,, < 1 we have 

F(x'^+i) < F(x'^) + V/(x'^)^(x'^+i- x*^) (15) 
+ ..*(||x'=+i- x''^||^,)+.g(x'^+i)-.g(x'^). 
Since g is convex, a G [0, 1], by using (14) we have 

g(x^-+i) - g(x^-) = .g((l - a,)x'= + aky") - g(x'=) 
<a,[.9(y')-.9(x'^-)] 
<avny'=-x^) (16) 

^i=''-a,V/(x'=)^d'=-a,||d'=||;,. 

Now, substituting (16) into (15) and noting that 
y^k+i _ = Qffed'^ we obtain the following result 

F(x'^+i) < i^(x'^)+w*(a ||d'=^||^J-afc(d'=^)'^VV(x'=^)d'^' 

= F{^'^)~[akXl~u*{akXk)\, (17) 

which is indeed (13), provided that akXk < 1- □ 

The following theorem provides an explicit formula for 
the step size a^. 

Theorem 3.4. Let x'^'^^ he a new point generated by 
the scheme (12) and Xk be defined by (10). Then, if 
we choose ak := (1 + A^)^^ e (0, 1] then 

F(x^'+') <F(x^-)-^(Afc). (18) 
Moreover, the step ^ {1 + X^)^^ is optimal. 

Proof. By the choice of at, we have akXk = (1 + 
Afe)~^Afc < 1. By using the estimate (13) we have 

F(x^-+i) < F{^'^)-{l + Xkr'Xl+Lu* ((1 + A,.)-iAfc) . 

Since — = ^(t) f™' any ^ > 0, the last 

inequality implies (18). Finally, we note that the func- 
tion ip{a) :~ aX{l -|- A) -|- ln(l — aA) is maximized at 
ak = (1 + Afc)^^, showing that ak is optimal. □ 

Theorem 3.4 shows that the damped proximal Newton 
scheme generates a new point x'^+^ that decreases F of 
(4) at least cj(ct) at each iteration, whenever Afc > o". 



Quadratic convergence: Similar to the full-step 
proximal- Newton scheme (9), we can also show 
the quadratic convergence of the damped proximal- 
Newton scheme (12). This statement is summarized 
in the following theorem. 

Theorem 3.5. For a given x e dom(F), tei x'^+i be 
a new point generated by the scheme (12) with 
(1 + Afe)^^. Then, if Xk < 1 — it holds that 

Xk+i<2il-2Xk-Xl)-'Xl. (19) 

Hence, the sequence {x^^ generated by (12) with 

ttfe = (1 + Afe)^"'^ starting from x'^ G dom(i^) such that 
Ao<cr<a-:=V5-2« 0.236068 locally converges to 
X*, the unique solution of (4) at a quadratic rate. 

The proof of Theorem 3.5 can be found in the supple- 
mentary document. Note that the value a in Theorem 
3.5 is larger than in Theorem 3.2. However, both val- 
ues are not tight. 

3.3. The algorithm pseudocode 

As proved by Theorems 3.4 and 3.5, we can use the 
damped proximal-Newton scheme to build the algo- 
rithm. Now, we present a two-phase proximal-Newton 
algorithm. We first select a constant a G (0,(7]. At 
each iteration, we compute the new point x'^"''^ by us- 
ing the damped proximal Newton scheme (12) until 
we get Afc < (T. Then, we switch to the full-step New- 
ton scheme and perform it until the convergence is 
achieved. These steps are described in Algorithm 1. 

Note that the radius a of the quadratic convergence 
region in Algorithm 1 can be fixed at its upper bound 
a. The maximum number of iterations jmax and fcmax 
can also be specified, if necessary. 

3.4. Iteration-complexity analysis 

We analyze the complexity of Algorithm 1 by separat- 
ing Phase 1 and Phase 2. This analysis is summarized 
in the following theorem. 

Theorem 3.6. The maximum number of itera- 
tions required in Phase 1 does not exceed jmax 

^ \ ^' '^^s'^s X* is the unique solution of 
(4). The maximum number of iterations required in 
Phase 2 to obtain Xk < £ does not exceed fcmax '■= 
0(lnln(f)), where c:= {1 - 4a + 2a^)-'^ > 0. 

Proof. Since Aj > a for all j > in Phase 1, it 
follows from Theorem 3.4 that F(x-'+i) < F(x-') - 
w((t). By induction we have F{x*) < i^(x-''"">=) < 
F{x°) - jmaxw(CT). This implies that j,„ax < [-F(x°) - 



A proximal Newton framework for composite minimization 



Algorithm 1 (Proximal Newton algorithm) 
Initialization: 

Require a starting point x*^ G dom(i<") and a con- 
stant a e (0,ct], where a := w 0.219224. 
Phase 1: [Damped proximal Newton iterations). 

for j = to jinax do 

1. Compute the proximal- Newton search direc- 
tion d-' as the unique solution of Q(x^ ). 



2. Compute Xj 



3. if Xj < <T then terminate Phase 1. 

4. Otherwise, update the next iteration x-'^^ := 
x^ + ajd^ , where aj := (1 -I- Xj)^^ . 

end for 

Phase 2: [Full-step proximal Newton iterations). 



Set x" 



x-' from Phase 1 and choose a desired 



accuracy e > 0. 

for fc = to fc„iax do 

1. Compute the proximal-Newton direction d'^ as 
the unique solution of Q(x'^). 

2. Compute := ||d''■||^^^. 

3. if Afc < e then terminate Phase 2. 

4. Otherwise, update x'^^^ := x*"' + d*"'. 
end for 



F[x*)]/uj[a). Hence, we can fix 



F[yP)-F[x*) 
u}[a) 



Let c := (1 -4(7-1- 2ct^)^^ > 0. By induction, it follows 



< 



from Theorem 3.2 that we have Xk < (c) 
(c) a . In order to ensure A^ < e, we require 
(c)^ ~^cr^ < e, which leads to fc < O (lnln(c/e)). 
Hence, we can show that fcmax O (lnln(c/e)). □ 



We note that we do not use jmax as a stopping criterion 
of Phase 1 of Algorithm 1. In practice, we only need 
an upper bound of this quantity. If we fix ct at ct then 
c 4.561553 and the complexity of Phase 2 becomes 
0(lnln(^)). 

3.5. Dual solution approach of the subproblem 



In this subsection we consider a specific instance of 
g: (7(x) := /r?||x||j^. First, we derive a dual formula- 
tion of the convex subproblem Q(x'^). For notational 
convenience, we let qfc := V/(x'^), := V^/(x''). 
Then, the convex subproblem Q(x*'') can be written 
equivalently as 



min \ -y^Hky + (q;. - H,x'=)^y + p\\y\\,}. (20) 



By using the min-max principle, we can write (20) as 
max min i iy^Hfcy-F(q-Hfcx'')^y-F/9u'^y 

ll"lloo<iye"^" 

(21) 

Solving the inner minimization in (21) we obtain: 

„„r,,{^"H-u + q-u}, (22) 

where q^. := ^(Hj^^qfc — x'^). Note that the objective 
function iy9fe(u) := ^u'^H^^u-t-q^u of (22) is strongly 
convex. One can apply the fast projected gradient 
methods with linear convergence rate in (Nesterov, 
2007; Beck & Teboulle, 2009) for solving this problem. 

In order to recover the solution of the primal subprob- 
lem Q[x''), we note that the solution of the paramet- 
ric minimization problem in (21) is given by y^(u) := 
x'^ — Hjr^(q'^ -|- pu). Let u^, be the optimal solution 
of (22). We can recover the primal proximal-Newton 
search direction d*^ of the subproblem Q(x'^) as 



-VV(x^rMV/(x^-)+puy. 



(23) 



To compute the quantity A^. := Hd*^] 
1, we use (23) such that 



A, 



|V/(x^-)+pu^| 



in Algorithm 



(24) 



Note that computing Afc by (24) requires the inverse 
of the Hessian matrix V^/(x'^). 

4. Application to graph selection 

In this section, we customize the theory framework of 
Algorithm 1 by using only Phase 1 to solve the graph 
selection problem (1). 

Quantification: For clarity, we retain the matrix 
variable form as presented in (1). We note that /(©) 
is a self-concordant convex function, while g[&) is a 
proper, lower scmicontinuous and nonsmooth convex 
function. Thus, our theory presented above can be ap- 
plied to (1). Given the current estimate &i >~ 0, we 

have V/(0,) = 2-0-^ and V2/(0*) = 0^^® ©ro- 
under this setting, the dual subproblem (22) becomes: 

U* = argmin | itr((0,U) 

||vec(U)||^<l L ^ 



tr(Q,;U) , (25) 



where Qi := p^^[&iH&i — 20;]. Given the dual so- 
lution U* of (25), the primal proximal- Newton search 
direction (i.e. the solution of Q(x'^)) is computed as 



(26) 



A, := - (0,S - 1)0, + p0,U*0, 
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The quantity Ai defined in (24) can be computed by 

A, := (p - 2 • tr (W,) + tr (W^))'/' . (27) 
where W; := 0,(S + pU*). 

The graph learning algorithm: Algorithm 2 sum- 
marizes the proposed scheme for graph selection. 

Algorithm 2 (Dual PN for graph selection (DPNGS)) 
Input: Matrix E ;^ and a given tolerance e > 0. 
Output: An approximate solution 0^ of (1). 
Initialization: Find a starting point ©o >- 0. 
for z = to imax do 

1. Set Q, :=p-i (0,S0,-20,). 

2. Compute U* in (25). 

3. Compute K by (27), where := 0i(S+pU*). 

4. If Ai < e terminate. 

5. Compute A, — - - + p©,U*0i) . 

6. Set := (1 + A,)-^ 

7. Update 0i+i ;= 0^ + A.^. 
end for 



Overall, Algorithm 2 does not require any matrix in- 
version operation. It only needs matrix-vector and 
matrix-matrix calculations, making the parallelization 
of the code easier. We note that due to the predefined 
step-size selection Ui in Algorithm 1 we do not need to 
do any backtracking line-search step. This advantage 
can avoid some overhead computation regarding the 
evaluation of the objective function which is usually 
expensive in this application. 

Arithmetical complexity analysis: Since the ana- 
lytical complexity is provided in Theorem 3.6, we only 
analyze the arithmetical complexity of Algorithm 2 
here. As we work through the dual problem, the pri- 
mal solution is dense even if majority of the entries 
are rather small (e.g., smaller than 10~^).^ Hence, the 
arithmetical complexity of Algorithm 2 is dominated 
by the complexity oi p x p matrix multiplications. 

For instance, the computation of Qi and re- 
quire basic matrix multiplications. For the compu- 
tation of A;, we require two trace operations: tr(Wi) 
in 0{p) time-complexity and tr(Wf) in 0{p^) time- 
complexity. We note here that, while is a dense 
matrix, the trace operation requires only the compu- 
tation of the diagonal elements of Wf . Given 0; , ai 
and Ai, requires 0{p^) time-complexity. 

To compute (25), we can use the fast pro- 
jected gradient method (FPGM) (Nesterov, 2007; 

^In our MATLAB code, we made no attempts for spar- 
sification of the primal solution. The overall complexity of 
the algorithm can be improved via thresholding tricks. 



Beck & Teboulle, 2009) with step size 1/L where L 
is the Lipschitz constant of the gradient of the ob- 
jective function in (25). It is easy to observe that 
Li ■■= 7max(®») where 7niax(0i) is the largest eigen- 
value of 0;. For sparse 0,, we can approximately 
compute 7max(0i) is 0{p'^) by using iterative power 
methods (typically, 10 iterations suffice). The projec- 
tion onto ||vec(U)||^ < 1 clips the elements by unity 
in 0{p^) time. Thus, the time overhead due to accel- 
eration is within 0{p^). 

Given the above, FPGM requires a constant number of 
iterations fcmax, which is independent of the dimension 
p, to achieve an Ein solution accuracy. Overall, the 
time-complexity for the solution in (25) is ©(fcmaxA^), 
where M is the cost of matrix multiplication. 

Remark 4.1 (Parallel and distributed implementa- 
tion ability). In Algorithm 2, the outer loop does not 
require any Cholesky decomposition or matrix inver- 
sion. Suppose that the fast projected gradient method 
is applied to solve the dual suhproblem (25). The main 
operation needed in the whole algorithm is matrix- 
matrix multiplication of the form 0iU0i, where 0^ 
and U are symmetric positive definite. This operation 
can naturally he computed in a parallel or distributed 
manner. For more details, we refer the reader to Chap- 
ter 1 in (Bertsekas & Tsitsiklis, 1989). 

5. Numerical experiments 

In this section we test DPNGS (Algorithm 2 in Section 
4) and compare it with the state-of-the-art graph selec- 
tion algorithm Quadratic Inverse Covariance (QUIC) 
algorithm (Hsieh et al., 2011) on a real world data set. 
The QUIC algorithm is also a Newton-based method, 
which in addition exploits the sparsity in solving its 
primal subproblems. We note that QUIC was imple- 
mented in C while our codes in this work are imple- 
mented in MATLAB. 

Implementation details: We test DPNGS on MAT- 
LAB 2011b running on a PC Intel Xeon X5690 at 
3.47GHz per core with 94Gb RAM. To solve (25), 
we use the FPGM scheme as detailed in the sup- 
plementary material. We terminate FPGM if either 
||Ufe+i-Ufc||p < £inmax{||Ufc||p , 1} or the number 
of iterations reaches fci„ax where ein > and /cmax will 
be specified later. The stopping criterion of the outer 
loop is \i < 10^^ and the maximum number of outer 
iterations is chosen as i,nax := 200. We test the follow- 
ing three variants of DPNGS: DPNGS [ei„ = 10"*^ and 
km.. = 1000], DPNGS(5) [£;„ = lO'^ and = 5], 
and DPNGS(IO) [£i„ = 10"^ and fc^ax = 10]. The DP- 
NGS(5) and DPNGS(IO) variants can be considered as 
inexact variants of DPNGS. 
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Real-world data: In our experiments, we use the 
real biology data preproeessed by (Li & Toh. 2010) 
to compare the performance of the DPNGS vari- 
ants above and QUIC (Hsich et al.. 2011) for 5 prob- 
lems: Lymph {p = 587), Estrogen (p ~ 692), 
Arabidopsis {p ~ 834), Leukemia {p — 1225) and 
Hereditary (p = 1869). This dataset can be found at 
http : //ima . umn . edu/~maxxa007/send_SICS/. 

Convergence behaviour analysis: First, we verify 
the convergence behaviour of Algorithm 2 by analyzing 
the quadratic convergence of the quantity Ai, where 
\i is defined by (27). Our analysis is based on the 
Ljrmph problem with p = 587 variables. We note that 
\i reveals the weighted norm of the proximal-gradient 
mapping of the problem. The convergence behaviour 
is plotted in Figure 1 for three different values of p, 
namely p = 0.25, p = 0.1, p = 0.05 and p = 0.01. 
Figure 1 shows that whenever the values of A.; gets 




iteration 



Figure 1. Quadratic convergence of DPNGS 

into the quadratic region, it converges with only a few 
iterations. As p becomes smaller, we need more itera- 
tions to get into the quadratic convergence region. 

Next, we illustrate the step-size ai of DPNGS. Fig- 
ure 2 shows the increasing behaviour of the step size 
on the same dataset. Since = (1 -I- Xi)~^, it con- 
verges quickly at the last iterations. We also com- 
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Figure 2. The step size of DPNGS 
pare the objective value decrement of both algorithms 



in y- log-scale in Figure 3. Using the same tolerance 
level, we reach the objective value —4.141662 x 10^ 
after 69 iterations while QUIC needs 159 iterations. 
Moreover, Figure 3 shows the quadratic convergence 
of our approach in contrast to QUIC; the latter re- 
quires many more iterations to slightly improve the 
objective. Figure 4 is the histogram of the solution in 
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Figure 3. The difference of the objective values of DPNGS 
and QUIC in y-log-scale 

log scale reported by DPNGS and QUIC. Due to the 
dual solution approach, DPNGS reports an approxi- 
mate solution with similar sparsity pattern as the one 
of QUIC. However, our solution has many small num- 
bers instead of zero as in QUIC as revealed in Figure 
4. This seems to be the main weakness of the dual ap- 
proach: it obviates matrix inversions by avoiding the 
primal problem, which can return solutions with exact 
zeros thanks to its soft-thresholding prox-operator. 

As a result, DPNGS carries around extremely small co- 
efficients (almost of them smaller than 5 x 10~^) often 
preventing it from achieving the same objective level as 
the numerical experiments on the full data set shows. 
At the same time, since the approach does not rely 
on coordinate descent on active sets, it appears much 
less sensitive to the choice of p. This could be an ad- 
vantage of DPNGS in applications requiring smaller p 
values. If exact sparsity is needed, then a single primal 
iteration suffices to remove the small coefficients. 
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Figure 4. The histogram of the coefficient absolute values 
of the solution in log-scale of DPNGS and QUIC (right). 

Numerical experiments on the full dataset: We 

now report the numerical experiments on the biology 
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Table 1. Summary of comparison results on real world datasets. 
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Lymph Problem 


(p = 587) 










1 J 1 IN VjtO 


19 


49.028 


613.26 


29 


61.548 341.89 


40 


66.635 


133.60 


69 


104.259 


-414.17 




39 


7.470 


613.42 


34 


8.257 342.12 


43 


8.678 


133.87 


78 


35.543 


-413.82 
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61 


7.067 


615.87 


30 


4.323 344.72 


41 


5.862 


136.37 


69 


123.552 


-414.17 
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44 
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201 2103.788 


-414.17 
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(p = 692) 
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24 


141.027 


627.87 


39 


171.721 251.20 


52 


167.460 


-11.59 


83 


205.262 


-643.21 


DPNOSfin") 


56 


15.500 


628.10 


49 


14.092 251.52 


59 


19.262 


-11.25 


90 


28.930 


-642.85 
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39 


9.310 


631.53 


46 


8.388 254.61 


51 


9.332 


-7.69 


81 


42.955 


-639.54 




19 


7.060 


627.85 


43 


49.235 251.19 


81 


244.242 


-11.60 
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26 


174.947 


728.57 


43 


220.365 228.16 


61 


253.180 


-146.10 


100 


430.505 


-1086.57 




48 
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45 
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26.007 


-145.72 


200 


101.428 


-1038.60 




38 


9.826 


733.67 


44 


11.113 233.04 


57 


18.378 


-141.84 


95 


73.948 


-1083.53 
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19.684 


728.52 


49 


116.016 228.14 


95 


562.532 


-146.13 
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28 


669.548 1143.79 


48 


624.145 386.37 


71 


726.688 


-279.93 


130 1398.133 


-2071.33 


DPNGS(IO) 


65 


82.497 1144.66 


48 


60.108 387.26 


68 


84.017 


-279.12 


126 


166.567 


-2070.02 


DPNGS(5) 


49 


38.317 1154.13 


48 


37.273 395.08 


70 


50.886 


-271.01 


124 


258.090 


-2060.25 


QUIC (C code) 


18 


69.826 1143.76 


41 


344.199 386.33 


76 1385.577 


-280.07 
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DPNGS 


41 2645.875 1258.31 


82 3805.608 -348.49 


113 5445.974 - 


1609.59 


183 9020.237 


-4569.85 


DPNGS(IO) 


63 


242.528 1261.15 


80 


297.131 -345.47 


126 


435.159 - 


1606.67 


190 


732.802 


-4566.66 


DPNGS(5) 


58 


129.821 1290.34 


79 


169.817 -313.87 


126 


439.386 - 


1606.67 


179 1140.932 


-4537.95 


QUIC (C code) 


21 


437.252 1258.00 


45 1197.895 -348.80 


84 3182.211 - 


1609.92 









dataset and compare the methods. We test both algo- 
rithms with four different values of p, namely p — 0.25, 
p = 0.1, p = 0.05 and p = 0.01. The numerical re- 
sults are reported in Table 1. Since QUIC exceeds 
the maximum number of iterations imax = 200 and 
takes a long time, we do not report the results corre- 
sponding to p = 0.01. We note again that at the time 
of this ICML submission, our implementation is done 
in MATLAB, while QUIC code is (carefully!) imple- 
mented in C. Hence, our timings may improve. 

We highlight several interesting results from Table 1. 
First, QUIC obtains the highest accuracy results in 
most cases, which we attribute to the "lack of soft 
thresholding" in our algorithm. As the DPNGS algo- 
rithm carries around a score of extremely small num- 
bers (effectively making the solution dense in the nu- 
merical sense), its solutions are close to QUIC's solu- 
tions within numerical precision. Moreover, QUIC is 
extremely efficient when the p value is large, since it 
exploits the sparsity of the putative solutions via co- 
ordinate descent. Unsurprisingly, QUIC slows down 
significantly as p is decreased. 

DPGNS(5) and DPNGS(IO) can obtain near optimal 
solutions quite rapidly. In particular, DPNGS(IO) 
seems to be the most competitive across the board, of- 
ten taking a fraction of QUIC's time to provide a very 
close solution. Hence, one can expect these schemes 



to be used for initializing other algorithms. For in- 
stance, QUIC can be a good candidate. We observed 
in all cases that, in the first few iterations, QUIC per- 
forms several Cholesky decompositions to stay within 
the positive definite cone. As the complexity of such 
operation is large, our step-size selection within QUIC 
or a DPNGS(IO) initialization can be helpful. 

6. Conclusions 

In this paper, we present the new composite self- 
concordant optimization framework. As a concrete ap- 
plication, we demonstrate that graph learning is possi- 
ble without any Cholesky decompositions via analytic 
step-size selection as well as without matrix inversions 
via a careful dual formulation within our framework. 
By exploiting the self-concordance in the composite 
graph learning objective, we provide an optimal step- 
size for this class of composite minimization with prox- 
imal Newton methods. We show that within the dual 
formulation of the Newton subproblem, we do not need 
to explicitly calculate the gradient as it appears in 
a multiplication form with the Hessian. Thanks to 
the special structure of this multiplication, we avoid 
matrix inversions in graph learning. Overall, we ex- 
pect our optimization framework to have more appli- 
cations in signal processing/machine learning and be 
amenable to various parallelization techniques, beyond 
the ones considered in the graph learning problem. 
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Supplementary material 



A. The proofs of technical statements 
A.l. The proof of Theorem 3.2 

Proof. Let x*"' e dom(F), we define 

Pi:= (VV(x'-)+%r\ 
Sk{z) V2/(x'')z - V/(z). 

and 

efc = efe(x) := [V^/lx"') - V'fi^)]d''). 

It follows from the optimality condition (7) in the main 
text that 

e %(x^'+i) + V/Cx'^) + V2/(x'^')(x'^'+' - x'^'). 
This condition can be written equivalently to 

Ski^'') + efc(x^) e VV(x'^)x"+i + ag(x'^+^). 
Therefore, the last relation leads to 

x^+i =P|(5fc(x'^)+efe). 
If we define d*^ x'''+^ — x*^ then 

d^- = P|(5fc(x^-) + efc)-x'=. 
Consequently, we also have 



(28) 



dfe+i = F,^(5fc(x^+i) + efe+i) - x'^+i. 



(29) 



We consider the norm A^. := 



d" 



fc+i 



By using the 



nonexpansive property of P^, it follows from (28) and 
(29) that 

= \\P^ (5.(x''^+i) + e,+i) - P^, (5fe(x'=) + e,) 

< ||5fc(x^-+i) + Bk+i - Ski^'') - efell*, 

< II V/(x''^+i) - Wf{^') VV(x'^-)(x''-+i - x^)||;, 
+ ||efc+i - efe||*fc 

/'[VV(x^) - V2/(x^)](x^-+i - x'=)dr 
Jo 



efc+i - efc||x&Jr2i 



[1] 
(30) 



where xj^ := x*^' + r(x'''+-'^ — x*^). First, we estimate the 
first term in the last line of (30) which we denote by 
[•][!]. Now, we define 



and 



[V'-fi^" + r(x' 



fe+i 



X'-O) - V2/(x^-)]dT, 



Nfe V2/(x")-i/'MfcVV(x")- 



1/2 



Similar to the proof of Theorem 4.1.14 in (Nes- 
terov, 2004), we can show that ||Nfc|| < (1 — 

)-i d*" . Combining this inequality and 

(30) we deduce 



= ||Mfed'=||^,<||Nfe| 



(31) 



Next, we estimate the second term of (30) which is 
denoted by [•][2]. We note that e^ = efc(x'^) = and 

e,+i = efc+i(x^+i) = [V2/(x'=) - v2/(x'^'+^)]d^+\ 
Let 

Pfe := V2/(x^-)-'^'[V'/(x"+^) - V2/(x"-)]V^/(x^)-i/^ 

By applying Theorem 4.1.6 in (Nesterov, 2004), we can 
estimate ||Pfc|| as 



|Pfe|| < max<^ 1 - (1 



2Xk - XI 



(1 



Y 



(32) 



(1-A..)2- 

Therefore, from the definition of [•][2] we have 

[■%] = [l|efc+i - efc||*fc]^ 

= (e,.+i - efc)^V2/(x'=)-i(efc+i - e^) 
= (d"+i)^V2/(x")^/'P?-V2/(x'y/'d^-+i 
< ||Pfcin|d^+i||;,. (33) 

By substituting (32) into (33) we obtain 



2Afc - A2 



Substituting (31) and (34) into (30) we obtain 



(34) 



\i ^ 2Afe-A| 1 
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By rearrange this inequality we obtain 

1 - Afc 



Al.< 



1-4A, 



2A| 



Al 



(35) 



On the other hand, by applying Theorem 4.1.6 in (Nes- 
terov, 2004), we can easily show that 



A 



k+l 



jfe+1 



< 



ik+1 



1 



1- Afc 



(36) 



Combining (35) and (36) we obtain 
Afc+i 



< 



A?. 



1 



4Afc + 2X1 ' 



which is (11) in the main text. Finally, we consider the 
sequence {x''}^^^^ generated by (9) in the main text. 
From (11) in the main text, we have 

Ai < (1-4Ao + 2A2)-ia2 

< (1 -4cr + 2cr2)-lcr2 

< (J 

provided that < cr < « 0.219224. By induc- 

tion, we can conclude that A^ < /? for all /c > 0. It 
follows from (11) in the main text that 



< {1 - Aa + 2a^y^ Xl 



for all k, which shows that | ||x'' — x* } converges 
to zero at a quadratic rate. □ 

A.2. The proof of Theorem 3.5 

Proof. First, we note that 



= x^+afcd'^=x'' + (l + Afe)-'x' 



Hence, we can estimate d'^^^ as 
= (1 + Afe) 



A 



k+l 



< 



1 - akXk 



jfc+i 



(37) 



By a similar approach as the proof of Theorem 3.5, we 
can estimate ||d|L t as 



2A? 



(l + Afc)(l-2Afc-A2)- 



Combining this inequality and (37) we obtain (19) in 
the main text. 

In order to prove the quadratic convergence, we first 
show that if Afc < cr then Afe+i < a for all fc > 0. 
Indeed, we note that the function: 

ip{t) -.^ 2t{l - 2t - t^)-^ 



is increasing in [0,1 — 1/V2]. Let Ao < a. From (19) 
we have: 



Therefore, if 



Ai < 20-2(1 _ 2a -a^). 



2(7^(1 - 2(7- Cr^) < (T, 



then Ai < cr. The last requirement leads to < cr < 
a := \/5 — 2 « 0.236068. From this argument, we 
conclude that if cr e (0, fr] then if Aq < cr then Ai < a. 
By induction, we have Afc < cr for fc > 0. If we define 

c := 2{1 - 2a - (T^)-^ 

then c > and (19) implies Afc+i < cA^ which shows 
that the sequence {Afc}fe>o locally converges to at a 
quadratic rate. □ 

A. 3. The proof of Lemma 2.2. 

Proof. From the self-concordance of / we have: 

^(l|y - x||J + /(x) + V/(x)^(y - x) < /(y). 

On the other hand, since g is convex we have 

g(y) >5(x)+v^(y-x) 

for any v S dg{x.). Hence, 

F(y) > F(x) + [V/(x) + v]^(y - x) + u:{\\y - x|l J 
>F(x)-A(x)||y-x||^ + c.(||y-xl|J, 

where A(x) := || V/(x) + vj]* . Let: 

CpiF{^)) := {y e M" | i^(y) < F(x)} 

be a sublevel set of F. For any y E £i?(F(x)) we have 
Fiy) < -P(x) which leads to: 

A(x)||y-x||,>c^(||y-x||J 

due to the previous inequality. Note that w is a convex 
and strictly increasing, the equation A(x)i = Lu{t) has 
unique solution t > if A(x) < 1. Therefore, for 
any < t < i we have ||y — x||^ < t. This implies 
that Cf{F{x)) is bounded. Hence, x* exists. The 
uniqueness of x* follows from the increase of w. □ 

B. A fast projected gradient algorithm 

For completeness, we provide here a variant of the 
fast-projected gradient method for solving the dual 
subproblem (25) in the main text. Let us recall that 
clip^(X) := sign(X) min{|X|, r} (a point-wise opera- 
tor). The algorithm is presented as follows. 
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Algorithm 3 {Fast-projected- gradient algorithm) 
Input: The current iteration 0^ and a given tol- 
erance Ein > 0. 

Output: An approximate solution Ufc of (25) in 
the main text. 

Initialization: Compute a Lipschitz constant L 
and find a starting point Uq >- 0. 
Set Vo := Uo, to := 1- 

for fc = to fcmax do 

1. V,+i:=clipi (Ufc-i [0,(Ufc + iE)0,-^0, 

2. If||Vfc+i - Vfcll 

Fro - einmax{l, llVfellp^.^} then 

terminate. 

3. tk+i 0.5(1 + v/T+4if) and /3fc f^. 

4. Ufc+i := Vfe+i + /3fc(Vfe+i - Vfc). 
end for 



The main operator in Algorithm 3 is 0iUfe0i at Step 
2, where 0^ and U^. arc symmetric and 0; may be 
sparse. This operator requires twice matrix-matrix 
multiplications. The worst-case complexity of Algo- 
rithm 3 is typically O which is sublinear. If 
/.t = Amin(0i), the smallest eigenvalue of 0^, is avail- 
able, we can set /3fc ^ ^ and wc get a linear 
convergence rate. 



